Chiang Rai
Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models
Kim, Dahyun, Lee, Sukyung, Kim, Yungi, Rutherford, Attapol, Park, Chanjun
The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks that assess their core capabilities, such as reasoning, knowledge, and commonsense, leading to the inception of certain widely-used benchmark suites such as the H6 benchmark. However, these benchmark suites are primarily built for the English language, and there exists a lack thereof for under-represented languages, in terms of LLM development, such as Thai. On the other hand, developing LLMs for Thai should also include enhancing the cultural understanding as well as core capabilities. To address these dual challenge in Thai LLM research, we propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI). Through a thorough evaluation of various LLMs with multi-lingual capabilities, we provide a comprehensive analysis of the proposed benchmarks and how they contribute to Thai LLM development. Furthermore, we will make both the datasets and evaluation code publicly available to encourage further research and development for Thai LLMs.
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.05)
- Asia > Thailand > Maha Sarakham > Maha Sarakham (0.04)
- Asia > South Korea > Busan > Busan (0.04)
- (5 more...)
AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian
Puccetti, Giovanni, Rogers, Anna, Alzetta, Chiara, Dell'Orletta, Felice, Esuli, Andrea
Large Language Models (LLMs) are increasingly used as "content farm" models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic. We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real "content farm". We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge. Our results suggest that there are currently no practical methods for detecting synthetic news-like texts 'in the wild', while generating them is too easy. We highlight the urgency of more NLP research on this problem.
- Asia > Singapore (0.04)
- Africa > Middle East > Tunisia (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- (17 more...)
- Law (1.00)
- Media > News (0.93)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Segmentation-free Connectionist Temporal Classification loss based OCR Model for Text Captcha Classification
Khatavkar, Vaibhav, Velankar, Makarand, Petkar, Sneha
Captcha are widely used to secure systems from automatic responses by distinguishing computer responses from human responses. Text, audio, video, picture picture-based Optical Character Recognition (OCR) are used for creating captcha. Text-based OCR captcha are the most often used captcha which faces issues namely, complex and distorted contents. There are attempts to build captcha detection and classification-based systems using machine learning and neural networks, which need to be tuned for accuracy. The existing systems face challenges in the recognition of distorted characters, handling variable-length captcha and finding sequential dependencies in captcha. In this work, we propose a segmentation-free OCR model for text captcha classification based on the connectionist temporal classification loss technique. The proposed model is trained and tested on a publicly available captcha dataset. The proposed model gives 99.80\% character level accuracy, while 95\% word level accuracy. The accuracy of the proposed model is compared with the state-of-the-art models and proves to be effective. The variable length complex captcha can be thus processed with the segmentation-free connectionist temporal classification loss technique with dependencies which will be massively used in securing the software systems.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Germany > Berlin (0.04)
- (19 more...)
IoMT Technology Automates Vital Signs Measurement
No one likes to schedule a medical appointment only to find an endless wait at a crowded doctor's office or clinic. But with a critical lack of healthcare workers, those waits aren't getting any shorter. The good news is IoMT (Internet of Medical Things) technology is helping take the pressure off overburdened staff. Self-service kiosks, powered by AI, can deliver a better patient experience--both in and out of the clinical setting. The shortage of medical workers may be new in some parts of the world, but it's a familiar problem in other markets.
- Asia > Thailand > Chiang Rai > Chiang Rai (0.05)
- Asia > Taiwan (0.05)
New facial recognition technology caught 'imposter' using someone else's passport, US officials say
A new facial recognition technology caught a man trying to enter the US using a passport belonging to someone else, US officials say. Officials with the US Customs and Border Protection (CBP) and the Office of Field Operations (OFO) intercepted a 26-year-old man, the agencies referred to as an "imposter", who reportedly attempted to use a French passport belonging to someone else, at Washington's Dulles International Airport. The man was travelling to the US from Brazil. "The officer utilised CBP's new facial comparison biometric technology which confirmed the man was not a match to the passport he presented," the CBP press release read. It added: "A search revealed the man's authentic Republic of Congo identification card concealed in his shoe."
- Asia > South Korea (0.48)
- Asia > North Korea (0.28)
- South America > Brazil (0.24)
- (42 more...)
- Transportation > Air (1.00)
- Leisure & Entertainment > Sports (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- (4 more...)
Sarah Jeong: New York Times journalist who tweeted 'cancel white people' is victim of 'dishonest' trolls, claims former employer
Sarah Jeong, a technology journalist hired by the New York Times and vilified online for tweets comparing "dumbass f****** white people" to dogs and saying they would "all go extinct soon", has been targeted for harassment by dishonest trolls, her former employer has claimed. Editors at The Verge, an online tech magazine, denounced what they called "disingenuous" criticism of Ms Jeong by "people acting in bad faith". The senior writer had been the victim of a Gamergate-style campaign designed to "divide and conquer by forcing newsrooms to disavow their colleagues", they suggested. Ms Jeong, 30, posted a string of offensive and apparently racist messages including "#CancelWhitePeople" and "white men are bulls***" up to five years ago. After being uncovered they quickly spread and were picked up by conservative media including the Daily Caller and Gateway Pundit websites.
- Asia > North Korea (0.47)
- Asia > Russia (0.05)
- Europe > Croatia (0.05)
- (38 more...)
- Media > News (1.00)
- Law (1.00)
- Leisure & Entertainment > Sports > Soccer (0.97)
- (2 more...)